28 research outputs found
Patchwork: A Patch-wise Attention Network for Efficient Object Detection and Segmentation in Video Streams
Recent advances in single-frame object detection and segmentation techniques
have motivated a wide range of works to extend these methods to process video
streams. In this paper, we explore the idea of hard attention aimed for
latency-sensitive applications. Instead of reasoning about every frame
separately, our method selects and only processes a small sub-window of the
frame. Our technique then makes predictions for the full frame based on the
sub-windows from previous frames and the update from the current sub-window.
The latency reduction by this hard attention mechanism comes at the cost of
degraded accuracy. We made two contributions to address this. First, we propose
a specialized memory cell that recovers lost context when processing
sub-windows. Secondly, we adopt a Q-learning-based policy training strategy
that enables our approach to intelligently select the sub-windows such that the
staleness in the memory hurts the performance the least. Our experiments
suggest that our approach reduces the latency by approximately four times
without significantly sacrificing the accuracy on the ImageNet VID video object
detection dataset and the DAVIS video object segmentation dataset. We further
demonstrate that we can reinvest the saved computation into other parts of the
network, and thus resulting in an accuracy increase at a comparable
computational cost as the original system and beating other recently proposed
state-of-the-art methods in the low latency range.Comment: ICCV 2019 Camera Ready + Supplementar
FEELVOS: Fast End-to-End Embedding Learning for Video Object Segmentation
Many of the recent successful methods for video object segmentation (VOS) are
overly complicated, heavily rely on fine-tuning on the first frame, and/or are
slow, and are hence of limited practical use. In this work, we propose FEELVOS
as a simple and fast method which does not rely on fine-tuning. In order to
segment a video, for each frame FEELVOS uses a semantic pixel-wise embedding
together with a global and a local matching mechanism to transfer information
from the first frame and from the previous frame of the video to the current
frame. In contrast to previous work, our embedding is only used as an internal
guidance of a convolutional network. Our novel dynamic segmentation head allows
us to train the network, including the embedding, end-to-end for the multiple
object segmentation task with a cross entropy loss. We achieve a new state of
the art in video object segmentation without fine-tuning with a J&F measure of
71.5% on the DAVIS 2017 validation set. We make our code and models available
at https://github.com/tensorflow/models/tree/master/research/feelvos.Comment: CVPR 2019 camera-ready versio
MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction
Predicting human behavior is a difficult and crucial task required for motion
planning. It is challenging in large part due to the highly uncertain and
multi-modal set of possible outcomes in real-world domains such as autonomous
driving. Beyond single MAP trajectory prediction, obtaining an accurate
probability distribution of the future is an area of active interest. We
present MultiPath, which leverages a fixed set of future state-sequence anchors
that correspond to modes of the trajectory distribution. At inference, our
model predicts a discrete distribution over the anchors and, for each anchor,
regresses offsets from anchor waypoints along with uncertainties, yielding a
Gaussian mixture at each time step. Our model is efficient, requiring only one
forward inference pass to obtain multi-modal future distributions, and the
output is parametric, allowing compact communication and analytical
probabilistic queries. We show on several datasets that our model achieves more
accurate predictions, and compared to sampling baselines, does so with an order
of magnitude fewer trajectories.Comment: Appears in CoRL 201
Pseudo-labeling for Scalable 3D Object Detection
To safely deploy autonomous vehicles, onboard perception systems must work
reliably at high accuracy across a diverse set of environments and geographies.
One of the most common techniques to improve the efficacy of such systems in
new domains involves collecting large labeled datasets, but such datasets can
be extremely costly to obtain, especially if each new deployment geography
requires additional data with expensive 3D bounding box annotations. We
demonstrate that pseudo-labeling for 3D object detection is an effective way to
exploit less expensive and more widely available unlabeled data, and can lead
to performance gains across various architectures, data augmentation
strategies, and sizes of the labeled dataset. Overall, we show that better
teacher models lead to better student models, and that we can distill expensive
teachers into efficient, simple students.
Specifically, we demonstrate that pseudo-label-trained student models can
outperform supervised models trained on 3-10 times the amount of labeled
examples. Using PointPillars [24], a two-year-old architecture, as our student
model, we are able to achieve state of the art accuracy simply by leveraging
large quantities of pseudo-labeled data. Lastly, we show that these student
models generalize better than supervised models to a new domain in which we
only have unlabeled data, making pseudo-label training an effective form of
unsupervised domain adaptation
SurfelGAN: Synthesizing Realistic Sensor Data for Autonomous Driving
Autonomous driving system development is critically dependent on the ability
to replay complex and diverse traffic scenarios in simulation. In such
scenarios, the ability to accurately simulate the vehicle sensors such as
cameras, lidar or radar is essential. However, current sensor simulators
leverage gaming engines such as Unreal or Unity, requiring manual creation of
environments, objects and material properties. Such approaches have limited
scalability and fail to produce realistic approximations of camera, lidar, and
radar data without significant additional work.
In this paper, we present a simple yet effective approach to generate
realistic scenario sensor data, based only on a limited amount of lidar and
camera data collected by an autonomous vehicle. Our approach uses
texture-mapped surfels to efficiently reconstruct the scene from an initial
vehicle pass or set of passes, preserving rich information about object 3D
geometry and appearance, as well as the scene conditions. We then leverage a
SurfelGAN network to reconstruct realistic camera images for novel positions
and orientations of the self-driving vehicle and moving objects in the scene.
We demonstrate our approach on the Waymo Open Dataset and show that it can
synthesize realistic camera data for simulated scenarios. We also create a
novel dataset that contains cases in which two self-driving vehicles observe
the same scene at the same time. We use this dataset to provide additional
evaluation and demonstrate the usefulness of our SurfelGAN model
To the Point: Efficient 3D Object Detection in the Range Image with Graph Convolution Kernels
3D object detection is vital for many robotics applications. For tasks where
a 2D perspective range image exists, we propose to learn a 3D representation
directly from this range image view. To this end, we designed a 2D
convolutional network architecture that carries the 3D spherical coordinates of
each pixel throughout the network. Its layers can consume any arbitrary
convolution kernel in place of the default inner product kernel and exploit the
underlying local geometry around each pixel. We outline four such kernels: a
dense kernel according to the bag-of-words paradigm, and three graph kernels
inspired by recent graph neural network advances: the Transformer, the
PointNet, and the Edge Convolution. We also explore cross-modality fusion with
the camera image, facilitated by operating in the perspective range image view.
Our method performs competitively on the Waymo Open Dataset and improves the
state-of-the-art AP for pedestrian detection from 69.7% to 75.5%. It is also
efficient in that our smallest model, which still outperforms the popular
PointPillars in quality, requires 180 times fewer FLOPS and model parameter
SoDA: Multi-Object Tracking with Soft Data Association
Robust multi-object tracking (MOT) is a prerequisite fora safe deployment of
self-driving cars. Tracking objects, however, remains a highly challenging
problem, especially in cluttered autonomous driving scenes in which objects
tend to interact with each other in complex ways and frequently get occluded.
We propose a novel approach to MOT that uses attention to compute track
embeddings that encode the spatiotemporal dependencies between observed
objects. This attention measurement encoding allows our model to relax hard
data associations, which may lead to unrecoverable errors. Instead, our model
aggregates information from all object detections via soft data associations.
The resulting latent space representation allows our model to learn to reason
about occlusions in a holistic data-driven way and maintain track estimates for
objects even when they are occluded. Our experimental results on the Waymo
OpenDataset suggest that our approach leverages modern large-scale datasets and
performs favorably compared to the state of the art in visual multi-object
tracking
TNT: Target-driveN Trajectory Prediction
Predicting the future behavior of moving agents is essential for real world
applications. It is challenging as the intent of the agent and the
corresponding behavior is unknown and intrinsically multimodal. Our key insight
is that for prediction within a moderate time horizon, the future modes can be
effectively captured by a set of target states. This leads to our target-driven
trajectory prediction (TNT) framework. TNT has three stages which are trained
end-to-end. It first predicts an agent's potential target states steps into
the future, by encoding its interactions with the environment and the other
agents. TNT then generates trajectory state sequences conditioned on targets. A
final stage estimates trajectory likelihoods and a final compact set of
trajectory predictions is selected. This is in contrast to previous work which
models agent intents as latent variables, and relies on test-time sampling to
generate diverse trajectories. We benchmark TNT on trajectory prediction of
vehicles and pedestrians, where we outperform state-of-the-art on Argoverse
Forecasting, INTERACTION, Stanford Drone and an in-house
Pedestrian-at-Intersection dataset
StarNet: Targeted Computation for Object Detection in Point Clouds
Detecting objects from LiDAR point clouds is an important component of
self-driving car technology as LiDAR provides high resolution spatial
information. Previous work on point-cloud 3D object detection has re-purposed
convolutional approaches from traditional camera imagery. In this work, we
present an object detection system called StarNet designed specifically to take
advantage of the sparse and 3D nature of point cloud data. StarNet is entirely
point-based, uses no global information, has data dependent anchors, and uses
sampling instead of learned region proposals. We demonstrate how this design
leads to competitive or superior performance on the large Waymo Open Dataset
and the KITTI detection dataset, as compared to convolutional baselines. In
particular, we show how our detector can outperform a competitive baseline on
Pedestrian detection on the Waymo Open Dataset by more than 7 absolute mAP
while being more computationally efficient. We show how our redesign---namely
using only local information and using sampling instead of learned
proposals---leads to a significantly more flexible and adaptable system: we
demonstrate how we can vary the computational cost of a single trained StarNet
without retraining, and how we can target proposals towards areas of interest
with priors and heuristics. Finally, we show how our design allows for
incorporating temporal context by using detections from previous frames to
target computation of the detector, which leads to further improvements in
performance without additional computational cost
Scalability in Perception for Autonomous Driving: Waymo Open Dataset
The research community has increasing interest in autonomous driving
research, despite the resource intensity of obtaining representative real world
data. Existing self-driving datasets are limited in the scale and variation of
the environments they capture, even though generalization within and between
operating regions is crucial to the overall viability of the technology. In an
effort to help align the research community's contributions with real-world
self-driving problems, we introduce a new large scale, high quality, diverse
dataset. Our new dataset consists of 1150 scenes that each span 20 seconds,
consisting of well synchronized and calibrated high quality LiDAR and camera
data captured across a range of urban and suburban geographies. It is 15x more
diverse than the largest camera+LiDAR dataset available based on our proposed
diversity metric. We exhaustively annotated this data with 2D (camera image)
and 3D (LiDAR) bounding boxes, with consistent identifiers across frames.
Finally, we provide strong baselines for 2D as well as 3D detection and
tracking tasks. We further study the effects of dataset size and generalization
across geographies on 3D detection methods. Find data, code and more up-to-date
information at http://www.waymo.com/open.Comment: CVPR 202